29 research outputs found

    Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

    Full text link
    Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.Comment: BalticHLT202

    Minimally-Supervised Morphological Segmentation using Adaptor Grammars

    Get PDF
    This paper explores the use of Adaptor Grammars, a nonparametric Bayesian modelling framework, for minimally supervised morphological segmentation. We compare three training methods: unsupervised training, semi-supervised training, and a novel model selection method. In the model selection method, we train unsupervised Adaptor Grammars using an over-articulated metagrammar, then use a small labelled data set to select which potential morph boundaries identified by the metagrammar should be returned in the final output. We evaluate on five languages and show that semi-supervised training provides a boost over unsupervised training, while the model selection method yields the best average results over all languages and is competitive with state-of-the-art semi-supervised systems. Moreover, this method provides the potential to tune performance according to different evaluation metrics or downstream tasks.12 page(s

    Automatic prediction of depression and anxiety from spontaneous written language: data collection pilot study

    Get PDF
    Uurimistöö eesmärgiks oli välja töötada meetod tekstilise andmestiku kogumiseks, mille alusel saaks hiljem arendada masinõppel põhinevaid meetodeid depressiooni ja ärevuse riski automaatseks hindamiseks. Töö käigus koostati ankeet, mille abil koguti tekstilist materjali ligi 300-st vabatahtlikust koosnevalt mugavusvalimilt. Kogutud tekstid sisaldasid nii etteantud pildi kirjeldust kui ka vabalt valitud sündmuse või mälestuse kirjeldust. Valimis osalenute emotsionaalset seisundit mõõdeti EEK-2 skriiningtesti abil. Ligi 42% isikutest ületas depressiooni ning ligi 30% isikutest ärevuse alaskaala riskilävendi. Esialgsed eksperimendid masinõppe mudelitega, mis püüdsid ennustada, kas inimese EEK-2 skoor ületab depressiooni ja/või ärevuse riskilävendi, edukaid tulemusi ei andnud. Kokkuvõttes tundub, et etteantud pildi kirjeldamine ei ole sobivaim viis soovitud andmestiku kogumiseks ja pigem peaks kasutama selliseid kirjutamise ülesandeid, mis oleks inimese endaga rohkem seotud

    Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

    Full text link
    We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w.r.t. the data augmentation method